Authorship Identification in Large Email Collections: Experiments Using Features that Belong to Different Linguistic Levels - Notebook for PAN at CLEF 2011
نویسندگان
چکیده
The aim of this paper is to explore the usefulness of using features from different linguistic levels to email authorship identification. Using various email datasets provided by PAN’11 lab we tested several feature groups in both authorship attribution and authorship verification subtasks. The selected feature groups combined with Regularized Logistic Regression and One-Class SVMmachine learning methods performed well above average in authorship attribution subtasks and below average in authorship verification subtasks.
منابع مشابه
A Multitude of Linguistically-rich Features for Authorship Attribution - Notebook for PAN at CLEF 2011
This paper reports on the procedure and learning models we adopted for the ‘PAN 2011 Author Identification’ challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing different language levels. For the autho...
متن کاملEPSMS and the Document Occurrence Representation for Authorship Identification - Notebook for PAN at CLEF 2011
This paper describes the participation of the PISIS team in the authorship identification track of PAN’11. We adopted two different strategies for the tasks of authorship attribution and authorship verification. For authorship attribution we performed experiments with a document occurrence representation using a standard classification-based approach. Results obtained with this approach were mi...
متن کاملAuthorship Identification of E-mail as a Multi-Class Task - Notebook for PAN at CLEF 2011
In this paper, we describe a multi-class text categorization approach to authorship attribution and test it on sets of e-mail collections. The PAN 2011 competition data consists of e-mails of variable length, written by various candidate authors, with some represented by significantly longer or more e-mails than others. Rather than construct a classifier for each separate author to discriminate...
متن کاملVote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011
For the PAN 2011 authorship identification challenge we have developed a system based on a meta-classifier which selectively uses the results of multiple base classifiers. In addition we also performed feature engineering based on the given domain of e-mails. We present our system as well as results on the evaluation dataset. Our system performed second and third best in the authorship attribut...
متن کاملAuthorship Identification with Modality Specific Meta Features - Notebook for PAN at CLEF 2011
This paper presents the approach used in the PAN ’11 authorship identification competition. Our method extracts meta features from several independently generated clustering solutions from the training set. Each clustering solution uses a disjoint set of features that represent a specific linguistic modality. The different clustering solutions encode similarities in writing styles of authors ac...
متن کامل